Reliable, high-throughput, high-performance transport and routing mechanism for arbitrary data flows

ABSTRACT

The present invention leverages an existing content delivery network infrastructure to provide a system that enhances performance for any application that uses the Internet Protocol (IP) as its underlying transport mechanism. An overlay network comprises a set of edge nodes, intermediate nodes, and gateway nodes. This network provides optimized routing of IP packets. Internet application users can use the overlay to obtain improved performance during normal network conditions, to obtain or maintain good performance where normal default BGP routing would otherwise force the user over congested or poorly performing paths, or to enable the user to maintain communications to a target server application even during network outages.

This application is a continuation of Ser. No. 11/323,342, filed Dec.30, 2005, now U.S. Pat. No. 7,660,296.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates generally to data packet transport androuting over the Internet.

2. Brief Description of the Related Art

The public Internet is increasingly being used by enterprises for avariety of mission-critical applications such as transactions fore-commerce, inter-office connectivity over virtual private networks(VPNs), and most recently, for web services as a new paradigm fordeveloping distributed applications. The current Border Gateway Protocol(BGP) based Internet routing infrastructure, however, is inadequate tosupport the reliability and performance needs of these applications. Inparticular, Internet routing, largely determined by the BGP protocol,has several weaknesses. First, BGP uses a metric known as shortest AS(Autonomous System) path length to determine a next hop for a packet.FIG. 1A shows that BGP will route data from Network A destined forNetwork D directly, because the AS path length is one. This is notalways desirable, as it has been shown that BGP is slow to converge.Thus, if a link between two networks becomes unavailable, it can takeseconds to several minutes before all relevant routers become aware andcan route around the problem. During this time, packets will be lost.Furthermore, as illustrated in FIG. 1B, peering policies may dictatethat a network should not accept packets from another network; BGPcannot efficiently route around this problem. Another problem is thatdifferent Internet applications require different characteristics (e.g.,minimal loss, latency, or variability in latency) of an end-to-endconnection for optimal performance. BGP makes no effort to route forquality of service and has no notion of any of these metrics. Asillustrated in FIG. 2, BGP will choose a route (between Networks A andD) with larger latency than alternative routes.”

There is a need in the art for intelligent routing as businessesincreasingly rely on the Internet for such applications as Webtransactions, virtual private networks (VPNs) and Web Services. Thenotion of intelligent routing based on measurements of real time networkconditions is known in the art, e.g., such as the product offerings byRouteScience and other companies. These products, however, only have anability to control only the first hop of the outbound route, namely, byinjecting appropriate directives into the router. Attempts to controlthe inbound route, e.g., by affecting BGP advertisements, are limited bythe low frequency with which these advertisements can be changed, thecoarse granularity of the advertisements, the requirement of cooperationfrom multiple routers on the Internet, and the ubiquity of policyoverrides by several ISPs.

Distributed computer systems also are well-known in the prior art. Onesuch distributed computer system is a “content delivery network” or“CDN” that is operated and managed by a service provider. The serviceprovider may provide the service on its own behalf, or on behalf ofthird parties. A “distributed system” of this type typically refers to acollection of autonomous computers linked by a network or networks,together with the software, systems, protocols and techniques designedto facilitate various services, such as content delivery or the supportof outsourced site infrastructure. Typically, “content delivery” meansthe storage, caching, or transmission of content, streaming media andapplications on behalf of content providers, including ancillarytechnologies used therewith including, without limitation, requestrouting, provisioning, data monitoring and reporting, content targeting,personalization, and business intelligence. The term “outsourced siteinfrastructure” means the distributed systems and associatedtechnologies that enable an entity to operate and/or manage a thirdparty's Web site infrastructure, in whole or in part, on the thirdparty's behalf.

A known distributed computer system is assumed to have a set of machinesdistributed around the Internet. Typically, most of the machines areservers located near the edge of the Internet, i.e., at or adjacent enduser access networks. A Network Operations Command Center (NOCC) may beused to administer and manage operations of the various machines in thesystem. Third party sites, such as Web site, offload delivery of content(e.g., HTML, embedded page objects, streaming media, software downloads,and the like) to the distributed computer system and, in particular, to“edge” servers. End users that desire such content may be directed tothe distributed computer system to obtain that content more reliably andefficiently. Although not shown in detail, the distributed computersystem may also include other infrastructure, such as a distributed datacollection system that collects usage and other data from the edgeservers, aggregates that data across a region or set of regions, andpasses that data to other back-end systems to facilitate monitoring,logging, alerts, billing, management and other operational andadministrative functions. Distributed network agents monitor the networkas well as the server loads and provide network, traffic and load datato a DNS query handling mechanism, which is authoritative for contentdomains being managed by the CDN. A distributed data transport mechanismmay be used to distribute control information (e.g., metadata to managecontent, to facilitate load balancing, and the like) to the edgeservers. As illustrated in FIG. 3, a given machine 300 comprisescommodity hardware (e.g., an Intel Pentium processor) 302 running anoperating system kernel (such as Linux or variant) 304 that supports oneor more applications 306 a-n. To facilitate content delivery services,for example, given machines typically run a set of applications, such asan HTTP Web proxy 307, a name server 308, a local monitoring process310, a distributed data collection process 312, and the like.”

Content delivery networks such as described above also may includeancillary networks or mechanisms to facilitate transport of certain dataor to improve data throughput. Thus, an Internet CDN may providetransport of streaming media using information dispersal techniqueswhereby a given stream is sent on multiple redundant paths. One suchtechnique is described in U.S. Pat. No. 6,751,673, titled “Streamingmedia subscription mechanism for a content delivery network,” assignedto Akamai Technologies, Inc. The CDN may also provide transportmechanisms to facilitate communications between a pair of hosts, e.g.,two CDN servers, or a CDN edge server and a customer origin server,based on performance data that has been collected over time. Arepresentative HTTP-based technique is described in U.S. PublishedPatent Application 2002/0163882, titled “Optimal route selection in acontent delivery network,” also assigned to Akamai Technologies, Inc.”

BRIEF SUMMARY OF THE INVENTION

It is an object of the present invention to provide a reliable,high-throughput, high-performance transport and routing mechanism forarbitrary data flows.

It is another object of the invention to provide an overlay on top ofthe Internet that routes data around problems in the Internet to find abest service route or a path with minimal latency and loss.

It is still another object of the invention to leverage an existingcontent delivery network infrastructure to provide a system thatenhances performance for any application that uses the Internet Protocol(IP) as its underlying transport mechanism.

Another more general object of the invention is to provide an overlaymechanism that improves the performance and reliability of businessapplications on the Internet;

Still another object of the invention is to provide techniques thatenable Internet application users to obtain improved performance duringnormal network conditions, to obtain or maintain good performance wherenormal default BGP routing would otherwise force the user over congestedor poorly performing paths, or to enable the user to continuecommunications even during network outages.

The present invention provides a scalable, highly available and reliableoverlay service that detects poor performing paths and routes aroundthem, as well as finding better performing alternative paths when thedirect path is functioning normally, thereby ensuring improvedapplication performance and reliability.

The foregoing has outlined some of the more pertinent features of theinvention. These features should be construed to be merely illustrative.Many other beneficial results can be attained by applying the disclosedinvention in a different manner or by modifying the invention as will bedescribed.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and theadvantages thereof, reference is now made to the following descriptionstaken in conjunction with the accompanying drawings, in which:

FIG. 1A illustrates a set of four interconnected networks and how BGPdetermines a route for packets flowing between Network A and Network D;

FIG. 1B illustrates how the destination server is unreachable when aproblem exists with how Network A peers with Network D;

FIG. 2 illustrates how BGP routing decisions are insensitive to latency,which results in poor performance;

FIG. 3 illustrates a typical content delivery network edge serverconfiguration;

FIG. 4 illustrates how the overlay mechanism of the present inventionroutes around data links to improve performance;

FIG. 5 illustrates how the overlay mechanism of the present inventionfinds a path with smallest latency to improve performance;

FIG. 6 illustrates a set of components that comprise an overlaymechanism, in accordance with an embodiment of the present invention;

FIG. 7 illustrates how data flows through the overlay mechanism of FIG.6;

FIG. 8 illustrates how a gateway region may manage network addresstranslation NAT in one embodiment;

FIG. 9 illustrates a sequence number translation function that may becarried out within a given gateway region;

FIG. 10 illustrates how different pieces of load information arereported on within the overlay;

FIG. 11 illustrates an alternate embodiment of the invention wherein aclient behind a corporate firewall IP is mapped directly to a gatewayregion while other clients are mapped to public regions;

FIG. 12 is a process flow diagram that illustrates how a global trafficmanagement (GTM) process may be implemented within the overlay in thealternate embodiment;

FIG. 13 illustrates normal overlay routing;

FIG. 14 illustrates a fail-safe operation for the overlay shown in FIG.13;

FIG. 15 illustrates how the overlay mechanism is used to implementmulti-client remote access to a given application on the target server;and

FIG. 16 illustrates how the overlay mechanism is used to implementsite-to-site (remote office) to a given application on the targetserver.

DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT

The present invention is an “overlay” mechanism that sits on top of thepublicly-routable Internet. Preferably, the overlay leverages existingcontent delivery network (CDN) infrastructure, although this is not arequirement. Generally, the overlay mechanism of the present inventionprovides performance enhancements for any application that uses IP as atransport protocol by routing around down links (FIG. 4) or finding apath with a smallest latency (FIG. 5). Thus, in FIG. 4, the link betweenNetworks A and D is down, so the overlay routes the packets throughNetwork B. In FIG. 5, the overlay uses the path through Network A toNetwork C to Network D, instead of slower Network A to Network D paththat BGP routing would dictate. In FIGS. 4 and 5, it is assumed that aclient is attempting to communicate with a server. The illustration of asingle or fixed client is not meant to be taken as limiting, however.This is a generalization. According to the present invention, theoverlay mechanism may be used in many different operating environments.One such environment is where there are multiple clients (e.g., roamingclients) who desire to access a single server (or server farm). This issometimes referred to a “remote access” embodiment. In another scenario,two or more fixed endpoints desire to communicate with each other. Thisis sometimes referred to a site-to-site or “remote office” embodiment.Generalizing, and as will be seen, the overlay mechanism operates byreceiving IP packets at one set of servers, tunnelling these packetsthrough a series of zero or more CDN servers, and delivering them to afixed, defined IP address.

The overlay IP (OIP) routing mechanism of the present inventioncomprises a representative set of components, as illustrated in FIG. 6:

-   -   edge server 602—typically, a CDN edge server running an OIP edge        server software process as described below. As will be        described, this software is responsible for receiving,        encapsulating and forwarding IP packets.    -   edge region 600—typically, a CDN edge region configured for the        overlay mechanism.    -   intermediate server 606—typically, a server that receives        encapsulated packets from an edge region 600 or other        intermediate servers and forwards them on to other intermediate        servers or to a gateway region.    -   intermediate region 604—a region of intermediate servers.    -   gateway server 610—typically, an ddge server that has been        configured to receive encapsulated packets from the overlay, and        that applies source network address translation (NAT) to the        original packets and forwards them onto the target server.    -   gateway region 608—typically, a type of edge region comprising        gateway servers and that is usually deployed on customer        premises.    -   Target server 612—a machine whose traffic is to be tunneled        through the overlay.    -   target address—the IP address of the target server; this address        is sometimes referred to as a direct address when being compared        to a CDN virtual IP address.    -   slot—a single “instance” of the overlay; preferably, a slot is a        numbered index that corresponds to a single target address.    -   virtual IP address—typically, a CDN address that corresponds to        a slot;

preferably, there is one virtual IP address per edge region per slot. Itis sometimes referred to as a VIP.

-   -   path 614—an ordered set of CDN regions between an ddge region        and a gateway region.    -   path Segment 616—a single hop of a path.    -   tunnel 618—a set of one or more paths from an edge server to a        gateway server.    -   session 620—A single end-to-end connection from the client 622        to the target server; preferably, the session is defined by a        five tuple (IP payload protocol, source address, destination        address, source port, destination port). The source is the        client and the destination is the target.

In the first embodiment, remote access, there are one or more clientsthat desire to send packets to a single IP address. FIG. 7 illustrateshow the overlay IP mechanism achieves this operation. At step 1, theclient 700 makes a DNS request to resolve a hostname. This hostname isaliased (e.g., by a CNAME) to a domain that is being managed by anauthoritative DNS 702; typically, the authoritative DNS is managed bythe CDN service provider. Preferably, this hostname corresponds to asingle gateway region (and target address) 704. This is also referred toas a slot, as described above. At step 2, the DNS query returns a singleIP address for the hostname. This address identifies a best performingavailable edge region 706 and, preferably, that region is dedicated tothe hostname. The address is referred to as a virtual IP address, asdescribed above. At step 3, the client 700 begins to send IP packets tothe virtual IP address. These packets are received by a server in theedge region 706. The edge region 706 knows the gateway region 704 towhich to send the packets based on the destination address in the IPpacket header. The packet is then encapsulated. At step 4, and based onroutes preferably provided by a CDN mapping system, the edge server inthe edge region 706 sends out multiple copies of the encapsulatedpackets along multiple paths. One technique for performing this multiplepath packet transport operation is described in U.S. Patent No. U.S.Pat. Nos. 6,665,726 and 6,751,673, assigned to Akamai Technologies, Inc.As illustrated at step 5, several intermediate servers receive theencapsulated packets and forward them (either directly, or through otherIntermediate Regions, not shown) to the gateway region 704, once again,preferably based on routes provided from the CDN mapping system. At step6, the packets are received by a server in the gateway region 704, whereduplicates are removed. Destination NAT translates the virtual IP to thetarget address and source Network Address Port Translation is applied tothe packet before it is sent, so that the return traffic will also besent over the overlay network. Preferably, information is stored so thatreturn traffic is sent to the edge region 706 from which the clientpacket originated. At step 7, the gateway region 704 receives an IPpacket from the target address and de-NATs the packet. The packet isthen encapsulated. At step 8, multiple copies of the packet are sentalong multiple paths. At step 9, the intermediate servers send thepackets back to the original edge region for this session. At step 10,the packets are received by an edge server and duplicates are removed.The packet is sourced from the virtual IP address and then sent back tothe requesting client 700. This completes the end-to-end transmission.

The following provides additional details for a representative edgeserver, intermediate server, and gateway server.

I. Edge Servers

The edge server runs a process (called oipd, for convenience), thatprovides the following functions: receives packets on virtual IPaddresses; filters packets based on expected protocol/ports, invalid orattack traffic, or specified access control lists (ACLs);encapsulates/decapsultes; forwards duplicate packets; and receivesduplicate packets. Each of these functions is now described.

Receive Packets

IP packets destined for a virtual IP address should not be handled by anedge server's TCP/IP stack; they must always be tunneled or dropped.Thus, the edge server operating system kernel includes a hook tointercept packets (and pass them to oipd) before they are handledlocally. In a representative embodiment, the edge server runs commodityhardware and the Linux operating system kernel. The kernel includesmodules ip_tables, ip_queue and iptable_filter. Upon machine startup, aconfiguration script sets rules (in the ip_tables modules) such that allpackets destined for any virtual IP address are delivered to user space.If no application is listening using the ip_queue module, the packetsare dropped.

Packet Filtering

Preferably, three (3) types of packet filtering are supported:expected/allowed TCP/UPD; invalid or attack traffic; CDN serviceprovider or customer-specified whitelist/blacklist ACLs. The edge serverprocess oipd preferably filters TCP/UDP packets based on ports.

Encapsulation

An important function provided by the edge server process isencapsulation. An encapsulation header may contain the followinginformation, or some portion thereof:

-   -   protocol version—standard version field.    -   TTL—the number of hops the packet can travel through before it        is automatically sent to the gateway region (or edge region        depending upon direction). It is decremented at each hop.    -   path number—each path a packet is sent along will have an        identifier. This number preferably corresponds to an index        specified in the map or other data structure.    -   forward state—used by the gateway server.    -   data length—the number of bytes in a payload of a OIP packet.    -   Source service address—the encapsulating machine's service        address.    -   destination service address—the service address of the receiving        side of the tunnel.    -   message sequence number—used to identify duplicate packets and        determine loss across a tunnel.    -   OIP slot number—the slot for the packet.    -   serial number—determined by hashing information of the IP        packet.    -   edge region number—the region number of the region sending the        packet.    -   SRMM middle region map rule—determines which intermediate        regions can be used for this slot.    -   message authentication code—used to determine authenticity of an        OIP packet.

When a packet is to be encapsulated, the oipd process knows the CDNcustomer based on the virtual IP address in the destination addressfield in the IP header. The process then looks up a configuration filethat contains the following map:

-   -   Virtual IP address→SRIP slot number

This is sometimes referred to as a “VIP map.” Preferably, this map isdetermined at install time and the various edge server components have aconsistent view of it. The slot number is then used to look up a gatewayregion number for this customer, preferably using a “Slot ConfigurationMap” that is generated by a slot configuration file. The edge serverprocess then hashes information into a serial number to break up theload into manageable chunks for load balancing. Preferably, the hashcontains at least the source IP address. In the case that the nextheader is TCP or UDP, the source port and destination port may be hashedas well.

The edge server process subscribes to a first low level map (service B,as described below) that maps serial numbers to service addresses. Inparticular, the serial number is used to index the map to identify thegateway region that contains a single service address. This is theservice address for which all (e.g., three (3)) copies of the packetwill be sent. The edge server process then checks to see if it hasforwarded any packets to this address in the past. If so, it incrementsa sequence number and uses this number in the “message sequence number”field in the encapsulation header. If the edge server process has notsent anything to this IP address yet, it initializes the state asdescribed below.

Forward Packets

The “forward packets” function operates as follows. At this point, mostof the encapsulation header is filled. The additional information forthe header is generated as follows. Preferably, all edge serverssubscribe to an assignment process (called SRMM and described below)that maps: MapperX.OIP.sr_assn_D_regionY where X is the correct Mapperprefix and Y is the region number. A destination region number and theSRMM middle region rule are used to index an assignment message. Thisyields one or more next hop region numbers, and preferably the number ofnext hop regions is configurable per slot. Each next hop is indexed witha path number. This path number is included in the encapsulation headerso downstream intermediate nodes know their next hops. If an assignmentmessage does not have any next regions, a single encapsulated packet issent to the destination server address (and this should trigger an alertin the NOCC). For each of the next hop regions, a second low level map(service C, as described below) is checked. The serial number used toindex this map preferably is the one included in the encapsulationpacket header. Preferably, each of these maps contains only a singleservice address for each serial number. This is the service addresswhere the packet should be forwarded to next. Preferably, the serialnumber is derived from connection identifying information (i.e., thefive tuple). If the next region to be sent to is the gateway region, thefirst low level map does not need to be checked because this was alreadydone by the sending edge server. Preferably, this initial lookup is donebefore a packet is sent along multiple paths to avoid having tosynchronize intermediate regions.

With the header information finished, a MAC is computed for the headerand the data. The computation may be based on SHA-1, MD5, or the like.To simplify the forwarding by intermediate nodes, the TTL field may notbe included in the hash as this field is mutable.

Duplicate Packets

The edge server process preferably sends duplicate packets, as has beenpreviously described. As noted above, preferably the edge server processsends multiple copies of each packet in several directions forredundancy. The receiving side needs an efficient way to filter outduplicates. When handling duplicate packets a first goal is to drop asfew packets as possible; a next goal is to send as few duplicate packetsas possible on to the target server. The edge server process tracksduplicate packets at the tunnel level. For every (edge source address,edge destination address) pair, the edge server process preferablymaintains a sliding window of state indicating which packets have beenreceived (and which have not) for every service address it receives apacket from. One processing algorithm is now described. The algorithmhas one parameter, which is the size of the window, and two data objectsto maintain: the window and the highest sequence number seen so faradjusted for wraparound. Preferably, the window size must be largeenough so that it is not unnecessarily reset. So,

-   -   oip.packet.window.size>=(max_packet_rate*max_packet_age)        On initialization, preferably, the entire window is initialized        to NULL. The highest sequence number is set to the number of the        first packet and the entry in the sliding window is set to SEEN.        For every new packet, if the sequence number>highest sequence        number+oip.packet.window size, the state is initialized and        started again. If the packet is within the window but less than        the highest so far, the algorithm checks if the entry in the        window has already been set to SEEN. If so, the packet is        dropped; otherwise, it is marked as SEEN. If the packet is        greater than the highest so far, the algorithm sets the highest        to the packet, marks that entry as SEEN and all entries between        as UNSEEN. Using three values requires two bits for state,        although using two values is sufficient for correctness. If a        sender restarts, there is a small chance that a randomly chosen        sequence number will be in the current window the receiver is        maintaining. This could cause packets to be dropped        unnecessarily. To prevent this, preferably a sender periodically        writes a map of (service address, last sequence number) for        every service address to which it has sent data. When an edge        server starts up, it reads this file and adds a large number to        it to ensure it is not in the window. This value should be        safely larger than the window size.

As noted above, preferably an edge region comprises a number of edgeservers. Accordingly, the system may implement an edge server failoverfunction. This ensures that if a single edge server fails, the number ofpackets that will be dropped as a result is minimized. Each edge regionpreferably has buddy groups of a given number of configured machines.For duplicate removal to work on failover, the SEEN packet state ispublished to all machines in the buddy group periodically. This data islikely to be changing very frequently. So, preferably each machine sendsan update to all machines in its buddy group indicating the highestsequence number SEEN. Each machine sends this information over thebackend network to all machines in its buddy group, preferably usingTCP.

II. Intermediate Servers

Intermediate servers have a “forward packets” functionality that issimilar to that implemented in an edge server. The intermediate serverssubscribes to the same MapperX.OIP.sr_assn_D_regionY channel. Eachintermediate region is assigned a SRMM middle region rule and shouldonly receive packets for slots that are configured for that rule. Adestination region number and the slot's middle region rule are used toindex an assignment message to determine the next hop. If a machinereceives packets for a slot that has a different middle region rule, itshould continue to send it on, but it should trigger an alert. Ifsending to another intermediate region, the second low level map is usedto determine the service address of the next hop server. If the nextregion is the gateway region, the destination service address in theheader preferably is used. Before the packet is sent on, the TTL fieldmust be decremented. If the TTL reaches one, the packet is forwarded tothe gateway region service address.

III. Gateway Servers

As mentioned above, preferably the gateway region is a special edgeregion that is generally located at the customer's data center. Ideally,this is somewhere close to the target server. The gateway server alsoruns an instance of the edge server process oipd, which can actsimultaneously as an edge and gateway. Each machine in the gatewayregion preferably has its own CDN-specific IP address used for installsand secure access, and as a service address used for overlay networkcommunication. Preferably, and as described above, a single VIP addressis used for fixed client mapping and several NAT addresses are used. Thegateway region preferably contains several machines for redundancy.These machines may be load balanced.

The gateway server provides the following functions: connectiontracking, state synchronization, network address translation, sequencenumber translation, in-region packet forwarding, returning packets tothe edge region. Each of these functions is now described in detail.

Connection Tracking

To track connections, the edge server process oipd makes use of NATlibrary such as libalias, which performs masquerading and IP addresstranslation. This library preferably stores connection trackinginformation in a hash table or other data structure. When a newconnection is established, a new entry is added to the table. When anexisting entry is referenced, its timestamp is updated. When an existingentry has not be referenced in a given time period (the meaning of whichvaries based on protocol and connection state), the entry is deletedfrom the table. When a new connection track is created, in addition tobeing added to the standard libalias database, preferably it is alsoadded to a list of newly-created but not yet synchronized connectiontracks. When an existing connection track is modified, it is added to alist of modified but not yet synchronized entries, unless it is alreadyin that list or in the new entries list. When an existing connectiontrack is deleted, it is removed from the libalias database, and it isadded to a list of deleted entries (unless it is currently in the newentries list, which means that information about it was neversynchronized and so the deletion information does not need to besynchronized). The oipd process will periodically synchronize updates tothe database (the new, modified, and deleted lists). This allows theconnection tracking database to be shared across an entire gatewayregion so that per-connection state information can move betweenmachines in the gateway region.

State Synchronization

The oipd process running in the gateway server preferably associatesseveral pieces of information with a single connection. This informationis synchronized across all machines in a gateway region. To this end, aseries of functions is added to libalias to facilitate the gathering ofdata to be synchronized. There is one function for synchronization ofeach type of list (new, mod, delete), one for gathering all records toinitialize a peer machine, and one for responding to a query for asingle entry. These functions are now described.

A function GetSyncNewData builds the data packet for synchronization ofall new records that have not yet been synchronized. In the case of TCPpackets, only those packets that are considered fully connected (i.e.the SYN packet has been seen from both the client and the server) willbe synchronized. Preferably, the oipd process ensures that a singleconnection will always be handled by the same machine until it is markedas fully connected. This ensures that potential race conditions relatedto synchronization of partially connected tracking entries and theirexpiration times can be avoided.

A function GetSyncModData builds the data packet for synchronization ofmodified records that have not yet been synchronized. It is desirablethat an active entry be synchronized at least often enough to ensurethat it does not time out in the database on a remote machine.Preferably, the oipd process ensures that every entry in the list issynchronized at least periodically (e.g., the timeout period for a UDPconnection entry). This ensures that connection entries do notincorrectly timeout while at the same time limiting the bandwidthrequired to keep the gateway region synchronized.

A function GetSyncDelData gathers the information for deleted recordsand a function GetSyncAllData gathers the information for all entries inthe database.

When synchronization data is received by a remote machine, it is passedin to a function SetSyncAddData if it applies to active connections(i.e. it was gathered by GetSyncAddData, GetSyncModData, orGetSyncAllData). This method creates a new connection entry in the localdatabase or updates an existing entry, if there is one. For TCPconnections, the state of the connection is tracked by libalias,preferably using two finite state machines, one for each direction(in/out). The function SetSyncAddData ensures that the finite statemachine on the local machine follows valid transitions so that an activeconnection is not incorrectly marked as not connected or disconnected.Synchronization data relating to records to be deleted is passed to afunction SetSyncDelData, which removes entries from the local table aslong as the local table's timestamp for the entry is not more recentthan that of the deleted entry.

Preferably, these libalias synchronization routines are used by aNAT_sync module, which contians a sender thread and a receiver thread.This module is initialized with the frontend and backend addresses ofall its synchronization peers (e.g., all other machines in the gatewayregion). The sender thread connects to one of its peers to retrieve afull snapshot of the connection tracking table as part of itsinitialization procedure. It then tries each peer in turn until it findsone that is responsive; it then requests a table update from thatmachine. Preferably, no other overlay activities proceed until thisinitialization is complete or all peer nodes have been tried. The senderthread preferably uses a real-time clock to drive synchronization. Forevery clock iteration, the sender thread uses GetSyncAddData,GetSyncDelData, and GetSyncModData to collect data for synchronization.This data is then sent to all live peers in the region. To check forliveness, the sender thread preferably attempts to establish a TCPconnect to a peer node using the last known good address (either thefrontend or the backend). If that connection fails, then the senderthread attempts to establish a connection over the other address. Ifthat fails, then the peer is assumed to be dead, although it may betried again on subsequent synchronization attempts. The rate at whichthe sender's synchronization clock iterates is set in libalias for useby the GetSyncModData algorithm to determine how many modificationrecords to send in each synchronization period. The rate isconfigurable. As noted above, on each clock iteration, all newconnection entries (except half-connected TCP), a reasonable number ofmodification entries, and all deletion entries will be synchronized. Thereceiver thread simply waits for remote connections to come in, and thenprocesses them according to message type. Update and deletion messagesare passed to the appropriate SetSync method. Initialization requestscause the receiver to run a function GetSyncAllData, and then to sendthe data to the remote machine.

Network Address Translation

As described above, the overlay performs source network addresstranslation (NAT) for all packets arriving on a service address (thegateway) that are to be sent to the target server. This ensures that thepackets will be sent back through the overlay. Addresses for source NATpreferably are defined on a per-region and per-machine basis.Preferably, given machines are not assigned a static NAT address, as itmay be desirable to move around NAT addresses. Also, when a packetarrives at a gateway server, preferably the NAT is applied differentlydepending on the type of connection involved. These connection types areconfigurable, e.g., per service or per port: TCP, long-lived; TCP,short-lived; UPD—session; UDP—query/response; and ICMP. Query/responseUDP connections typically only involve a single UDP packet from clientto server and zero or more response packets from server to client. ForUDP query/response and ICMP, preferably a new NAT address, port/ICMP idis chosen for each incoming packet. Preferably, each machine in thegateway region is given its own port/ICMP id range for which only it isallowed to use to create new NAT sessions. Upon receiving a packet, theedge server process (described in more detail below) chooses a freeaddress, port/ICMP id and sends the packet to the target server. Themachine that owns the NAT address receives response packets from theserver, but typically it will not have NAT session information totranslate the addresses back. Thus, the machine preferably checks thedestination port/ICMP id and forwards the response to the machine thatowns that port/ICMP id. This second machine then translates the packetback to contain the original client address and port/ICMP id and sendsit back over the overlay. This operation is illustrated in FIG. 8. Forboth types of TCP connections, preferably a new NAT address and port arechosen only when a SYN packet is seen. In either case, the NAT sourceport number is used to direct packets back to the server that createdthe session. For long-lived connections, it is desirable to reduce theamount of data that needs to be sent between gateway servers;preferably, this is accomplished by synchronizing the NAT sessioninformation. In particular, when a data packet arrives, an edge serverprocess (as described below) determines if any connection state dataexists. If so, the process applies the source NAT and sends out thepacket. If connection state data is not available, the edge serverprocess forwards the packet to an owning machine. For UDP connectionsthat involve multiple packets from the client, a machine will check ifit has session information for the client address and port. If so, themachine uses that NAT information. If not, the machine preferablyacquires a region-wide lock for this client address, port. This ensuresthat if packets arrive simultaneously on two gateway servers, only oneof them will establish the NAT session.

Sequence Number Translation

When the oipd process in the gateway server receives a SYN packet fromthe target server to the client, it performs a sequence numbertranslation to tag the packet indicating its host id in the packet.Whenever this process receives a packet from the overlay, if it does nothave the state information for the packet, it looks at the sequencenumber to determine the gateway server to which it should forward thepacket. Preferably , each hosts is assigned a unique identifier. Eachgateway region is assigned a maximum size and a sequence number spacepreferably is broken up into 2*max size pieces. To support SYN cookies,preferably only the highest eight bits are modified and each host isassigned two consecutive blocks of sequence number space. When the firstpacket arriving at a gateway region is handled by one server and asubsequent packet is handled by a different server before the state issynchronized, the second server preferably one uses the high eight bitsin the ACK sequence number to determine the server to which server itshould send the packet. This is illustrated in FIG. 9. Before sending apacket to the target server, the oipd process must unapply the sequencenumber translation so that the target server gets the correct ACKsequence number.

In-Region Forwarding

Before forwarding a packet to another gateway server, a forward state ina header field is modified to indicate that the packet has already beenforwarded. The receiving machine handles this packet as if it wasreceived from the overlay, but if the oipd process needs to forward thepacket again for any reason, it will instead drop the packet.

Returning Packets to the Edge Region

Preferably, a packet is sent from the target server back through theoverlay to the same edge region that sent it originally. This ensuresthat the edge server there will be able to send out an IP packet withthe source address of the Virtual IP to which the client originally senta packet. When a packet is received at the gateway region, theencapsulated IP packet contains the virtual IP that the client used andthe header will contain the edge region number. This information isstored with the connection and synchronized in the same way as the NATsession information. When a gateway server receives a packet from thetarget server, it replaces the source IP address (which contains thetarget address) with the virtual IP stored. It then forwards the packeton to the correct region.

As in the edge server regions, preferably a gateway region implements abuddy group of servers. This group is configured for a region, and eachserver therein includes a monitoring process to check for liveness andload. As necessary, machine or software failover can then beimplemented.

Load Reporting

The overlay mechanism preferably implements load balancing. To this end,preferably each server executes an in-machine DNS process that performsseveral load balancing functions: mapping Virtual IP addresses and NATIP addresses to live edge servers, mapping serial numbers to live edgeservers, and mapping serial numbers to live intermediate servers. Thesefunctions require three (3) distinct services: service B, service C, andservice D. Service B represents the bytes entering the overlay, whereasservice C represents the bytes exiting the overlay. FIG. 10 illustrateshow different pieces of load information are reported on the differentservices.

Service B maps both virtual IP addresses and NAT addresses ontomachines. At install time, a one-to-one mapping from slots to virtualaddresses and NAT slots to NAT addresses is established. To ensure thatall regions can support the same number of VIP slots, a network wideconfiguration variable is set, leaving a given set of serial numbers ina service B map for NAT addresses. NAT slots begin with a slot numberand work downwards. For service B, the in-machine DNS process needs toknow liveness of the edge server, capacity of the machine in bytes, anda byte load per slot. A monitoring process running on the machine isused to determine liveness. The oipd process reports the byte load inand out of the server to the monitoring process. For service C, thein-machine DNS process needs to know the liveness of the edge server, acapacity of the machine (according to a given metric), and a value ofthat metric per serial number. A representative metric is a “flit.” A“flit” is an arbitrary unit of work generally representing non-bandwidthresource usage on a given server machine. Such utilization typicallyencompasses CPU utilization, disk utilization, operating systemabstraction-limited resources such as threads and semaphores, and thelike, and combinations thereof. In a representative embodiment, a flitis a given linear or convex function of several individual machinevariables, such as CPU and disk utilizations. For the load balancingdescribed generally below, however, the number of bytes entering theserver is a good approximation for the flit value. For service D, thein-machine DNS process needs to know the liveness of the intermediateserver, a capacity of the machine (according to the given metric), and avalue of that metric per serial number.

Region Assignment

A region assignment process executes in the overlay mechanism'sauthoritative DNS. Because every connection mapped to a region hascorresponding traffic to that region from client→server andserver→client, both services B and C need to be taken into account. Tothis end, a region monitor process reports a sum of bytes in and out onboth services B and C. The region assignment process needs to know anedge region flit capacity metric, and an edge region byte load and flitload (for services B and C). Each connection mapped to a region affectsthe amount of load the region sends and receives because packetspreferably are sent to the edge region from which they originated. Tohandle this properly, the region assignment process preferably uses themaximum of bytes in, bytes out to determine how loaded a region is.

Map Maker

The assignment process (SRMM) uses two pieces of information: bytes perdestination region (same as slot), edge region pair; and region flitload for intermediate regions. An oipd process on the edge reports byteload that will be injected into the overlay, preferably broken down bydestination region. This is the demand that is put on the overlay and isthe value sent in a new message from the region monitor to the SRMMprocess. The intermediate regions report total service D bytes andflits. Because flit load correlates to byte load, the SRMM processpreferably use the edge bytes as demand and the flit cap and a byte/flitratio to infer a reasonable byte cap in the middle regions. Thisinformation goes into an existing region assignment message to whichSRMM then subscribes.

The (SRMM) process generates region level maps that specify the bestpaths from an edge region to a gateway region, as well as the best pathsfrom the gateway region back to the edge region. As noted above,preferably SRMM receives demand input from edge regions and load andcapacity information from the middle regions. In addition, eachintermediate region is assigned to one or more middle region map rulesand each slot will be assigned to a particular middle region map rule.Preferably, SRMM only maps traffic for a particular slot throughintermediate regions that have the same map rule.

In operation, SRMM determines multiple paths from every edge region toevery gateway region while ensuring that no two paths share the sameintermediate region and that all intermediate regions are properly loadbalanced. By default, it will choose three paths. To handle theserequirements, preferably the following simplifying assumptions are madefor the SRMM algorithm: the first path chosen is optimal (i.e. it cansupport multi-hop paths) given information from a ping subsystemproviding there is sufficient capacity; and, the next two paths containonly a single intermediate region.

The path determination is broken down into two parts corresponding tothe assumptions above. The first part involves running a sequence ofshortest path algorithms. For each middle map rule, Dijkstra's algorithmis run for each destination, where destination is all edge and gatewayregions. The edge byte load is used as demand in these calculations anda running total is tracked for each intermediate region. If the totalload is under capacity for each intermediate region, this part of thepath determination is done. If it is not, all nodes that are overcapacity will have a price adjustment, and the algorithm is repeated.The price adjustment is applied by multiplying a constant factor to thescores of each of the links that use the overflowing node. This processis repeated until stable. Then, the capacities at each node are updated.

In essence, a list of (source, destination) pairs of (edge regionsgateway regions) is determined. For each one, the algorithm chooses thebest intermediate node with sufficient capacity for that pair. Theoutput of the algorithm creates an IP assignment message for service Dper region. It is indexed by (destination region, path number).

Edge Mapping

As noted above, to use the service offering, customers CNAME a hostnameto the CDN service provider's DNS, which then becomes authoritative forthat domain. If desired, the hostname may in turn be configured to aglobal traffic management hostname to provide additional advantages.

By way of background, the overlay mechanism may use “per-customer”regions. A per-customer region supports traffic for only a single slot.Such an approach is desirable for users behind a corporate firewall. Inparticular, such users can then be mapped to a CDN region located nearthe target server that will keep LAN performance from degrading into WANperformance by going to a public CDN edge region. In FIG. 11, the clientbehind a corporate firewall IP is mapped directly to the gateway region,while other clients are mapped to public regions. Using global trafficmanagement, the users with specific, predefined source IP addresses aremapped to these “per-customer” regions.

FIG. 12 is a process flow diagram that illustrates how the globaltraffic management (GTM) process may be implemented within the overlayto provide this functionality. Where information is not available at agiven decision point, preferably the “No” path is taken. Thus, if theGTM does not have any predefined mapping data, the client is mapped to apublic region. Likewise, if the GTM does not have any geo information,the client is mapped to a public region. If the GTM does not haveliveness feedback information , it assume the region is down and mapsthe client using the direct IP address. The GTM preferably obtainsliveness information from the gateway region by running a ping test tothe virtual IP in that region.

Fail-Safe Mapping

By way of background, a customer's target server address may beconsidered a fail-safe IP address. If the overlay detects any end-to-endproblems (either customer-specific or otherwise), preferably the CDN'sauthoritative DNS hands out this IP address. This has the effect ofremoving the overlay from all communications. FIG. 13 illustrates normaloverlay routing, and FIG. 14 illustrates the fail-safe operation. Thefail-safe operation may be triggered as a result of various occurrences,e.g., any per-customer per-region problems, where multiple regions aredown or out of capacity, where the GTM does not have livenessinformation from testing the virtual IP at a gateway region, where oneor more name server processes do not have sufficient information to makeassignments, or the like.

Management Portal

The CDN service provider may provide a customer facing extranet thatallows customers to log-in and perform necessary routing configurationas well as view performance and traffic utilization statistics. Routingconfiguration includes entering BGP information so the edge servers mayestablish a BGP session with the local router and the application IPinformation for the applications that will use the overlay network.

The overlay mechanism of the present invention provides numerousadvantages. Preferably, the mechanism continuously monitors and analyzesperformance from each local edge node to the intermediate nodes, fromeach local edge node to each remote edge node, and between allintermediate nodes. Thus, for example, these performance measurementsinclude latency and packet loss. This data is then combined and analyzedto determine optimal paths between two edge locations. If the overlaynetwork is able to provide a better path than the BGP-determined path,the edge nodes will then intercept packets and route them viaintermediate nodes. In addition to routing around poor performing paths,the overlay also uses packet replication to ensure uptime. Theperformance measurement granularity of the overlay is such that anoutage could otherwise disrupt the transmission of packets. To avoidthis problem, packets are replicated at the edge and sent via multipleoverlay paths. One of the paths may be the directly routed path, butpackets are intercepted and encapsulated at the sending edge andde-encapsulated at the remote edge. Preferably, the first packetreceived at the remote end is used and the others are dropped. In analternative embodiment, a fail-safe IP address (typically the targetserver's IP address) is handed out by the CDN authoritative DNS undercertain conditions.

Remote Access Embodiment

In the remote access embodiment illustrated in FIG. 15, multiple clients1502 a-n are sending packets to a single IP address, the target server1504. The overlay mechanism 1500 comprises the authoritative DNS 1506,at least one edge region 1508, one or more intermediate regions 1510,and one or more gateway regions 1512. An application of interest isexecutable in part on the target server. A client side of theapplication executes on the client machine, which may be a laptop, amobile computing device, or the like. The particular functionality ofthe application (or how the application is implemented in a distributedmanner across the client and server) is transparent to the overlay. Itis merely assumed that communications between client and server occurover an IP transport. In this embodiment, the application is associatedwith an Internet domain. As has been previously described, that domainis aliased (via a CNAME) to an overlay network domain being managed bythe service provider. A DNS query to the application domain causes theauthoritative DNS 1506 to return a VIP address. A given client (whosemachine typically runs a client side application or instance) thenconnects to the application on the target server through the overlaymechanism as has been previously described. In particular, data packetsdestined for the application are encapsulated at the edge, duplicated,forwarded over multiple paths, and then processed at the gateway toremove duplicates. At the gateway, destination NAT translates thevirtual IP to the target address and source Network Address PortTranslation is applied to the packet before it is sent, so that thereturn traffic will also be sent over the overlay network. Preferably,information is stored so that return traffic is sent to the edge regionfrom which the client packet originated. When the application responds,the gateway region receives an IP packet from the target address andde-NATs the packet. The packet is then encapsulated. Multiple copies ofthe packet are then sent along multiple paths. The intermediate serverssend the packets back to the original edge region for this session. Thepackets are received by an edge server and duplicates are removed. Thepacket is sourced from the virtual IP address and then sent back to therequesting client.

Remote Office Embodiment

In this embodiment, as illustrated in FIG. 16, an application executeson a target server 1604 and the client machine 1602 (which itself may bean application or application instance) is on a fixed, remote endpoint.In a representative example, the client 1602 and server 1604 are locatedat a pair of geographically-distributed sites. For ease of discussion,it is assumed that the client machine 1602 is located outside anenterprise firewall but adjacent a router 1608. In this embodiment, theclient machine 1602 executes remote office acceleration (ROA) process1610. The process 1610 advertises itself as the target server IPaddress; thus, router 1608 sees that process as a router. By advertisingthe target server IP address, the process 1610 establishes a BGP sessionwith the router 1608 and thereby transparently intercepts data packetsintended for the application located at the remote office. Onceintercepted, the process 1610 performs the encapsulation and otherfunctions performed by the edge server in the remote access embodiment.The overlay mechanism provides the packet communications between the twosites in the manner previously described.

In the remote office embodiment, the overlay provides a method of dataencapsulation by degenerate use of the BGP protocol. Generalizing, thistechnique can be used for any given Internet protocol including EGRP,OSPF, and the like. While the degenerative BGP approach is preferred,packets may also be provided to the overlay using an in-line approach(e.g., a packet grabber).

Variants:

To save both client bandwidth and reduce service provider bandwidthcost, it may be desirable to implement a dynamic decision as to whatdegree to replicate packets. Thus, for example, it may be desired tostart with a given number (e.g., three (3) copies) and then reduce thisnumber dynamically if the loss/latency is acceptable.

Generalizing, the overlay mechanism as described about may begeneralized as a routing “cloud” in which the arbitrary data flows areimplemented intelligently. Within the cloud, and using theabove-described transport techniques, the data may pass through one ormore intermediate nodes, be transmitted along redundant paths, employTCP or UDP as a transport mechanism, obey flow or class specific logicfor determining optimality in a multidimensional parameter space, or anycombination thereof.

Based on the architecture and billing model employed, a local controlleror “knob machine” may be employed at the overlay ingress points. Themotivation for this element is that bandwidth flowing to the cloud andwithin the cloud has an inherent cost. In some cases (e.g., all VPNtraffic is being handled) it would not make sense for all of thistraffic to flow to the cloud because at minimum the cost basis for thesolution is twice the alternative. Thus, the purpose of the controlleris to provide a simple and effective means of supplying the system withcost/benefit tradeoff business logic. (As such, the box is effectively aknob controlling how aggressively the system is used—hence the name.)Based on the rules supplied, the controller makes a decision on a perpacket or per flow basis as to whether or not the traffic should be sentthrough the cloud or directly to the other end point. By default,preferably all traffic would flow through the controller. However, thecontroller would be configurable so as to employ appropriate businesslogic to decide when a flow should be sent to the cloud. Thisinformation may also affect the behavior that occurs within the cloud aswell.

The rules that could be employed by the controller are quite flexible.They include selecting traffic based on: domain (e.g., Intranet trafficis important but a given content provider is not), IP Address (e.g.,important to route traffic to the Tokyo office but not necessarilyChicago), performance predictions (e.g., the controller can understandvarious QoS metrics about the direct and alternate paths using theparent servers and chose to direct the traffic if the predictedimprovement is greater than a given threshold and/or if the directpath's quality is below a certain threshold), reliability predictions(e.g., the controller can understand various reliability metrics aboutthe direct and alternate paths and chose to use an alternate paththrough a parent server for the traffic if the predicted improvement isgreater than a given threshold and/or if the direct path's quality isbelow a certain threshold), or the like. Of course, these rules are morepowerful when used together. For example, one could choose to set theperformance and reliability metrics differently for different domains.

Thus, path optimization is the general idea of getting a data packetfrom one node to another via a CDN region based on performance data thathas been collected over time. This technique, as described above, hasbeen used to improve connectivity back to a customer origin server forHTTP traffic. Moreover, the ability to tune the service feature so thatthe feature is invoked preferably when it can improve performance enough(e.g., latency reduced by x %). According to the present invention, thecontent delivery network is viewed as a “cloud” into which data flowsmay enter, get optimally directed across the “middle” (e.g. by theoptimization technique), and appear on the other side as well aspossible on their way to their destination. For example, consider thecase of VPN traffic coming out of an office on the East Coast (e.g.,Cambridge) on its way to an office in California (e.g., in San Mateo).Instead of just using BGP, the present invention may direct this trafficas an “entry point.” This traffic then flows through the network, takingthe appropriate number of bounces, and then exits the network at an“exit point” near the appropriate office. In a particularimplementation, a controller is placed at the Cambridge office throughwhich all traffic flows. This controller is used to decide, in real-timebased on both business rules (e.g., traffic to a first domain isimportant but traffic to a second domain is not) and quality-of-service(QoS) rules (e.g., use the alternate path only if there is no directpath or if the alternate path is some % better). Of course, theseexamples do not limit the present invention.

Although not meant to be limiting, the controller machine can beimplemented in an edge or ISP network. Several potential applicationsare now described.

Web Transactions

Entities that provide secure transactions on the Web may use the presentinvention to improve the reliability and performance of thosetransactions. One example would be a provider of online credit cardtransactions. Transactions are appealing both because they are bydefinition not cacheable and, moreover, because each request isvaluable. The present invention can facilitate Web transactions bysimply adding HTTPS support to the edge server optimization technique.In this context, CDN regions that support secure content (e.g., via SSL)can be view as “entry points” to the HTTPS routing cloud, with theconnections to the origin or through the network being providedaccordingly. In essence, the HTTP edge server implementation is extendedinto the generalized cloud, where every edge region can serve as anentry point and where a certain set of regions (e.g., parent regions)can serve as exit points.

Another area of applicability is that fact that both the secure and HTTPclouds support all SOAP-based transactions inasmuch as SOAP isimplemented in HTTP. Thus, the present invention can be used toimplement SOAP-based Web transactions.

VPN QoS

Another opportunity for valuable flows lies in the realm of VPNs. VPNtraffic has the two attractive properties of being associated withinformation that the business finds valuable and also by definitionflowing between two geographically disparate locations. The simple ideais to see VPN traffic as an encrypted IP data stream with end-to-endsemantics behind the firewalls. Thus, it is possible to provide aUDP-based transport mechanism through the cloud. Most likely, thetraffic would flow transparently through a knob machine and directedinto the cloud appropriately. In this fashion, all traffic on the VPNwill benefit from the service based on the knob settings.

This technique also is compatible with enterprise content deliverynetwork offerings. With an enterprise CDN and the cloud, the CDN serviceprovider can improve the performance and reliability of every piece ofdata in the company if the company has a VPN-based architecture. Forexample, all VoD content and cacheable web content may be cached in thecustomer's office. However, live streaming cannot be cached. Theenterprise CDN can rate limit the traffic to protect finite resources.The cloud provides the redundancy and retransmits to improve the qualityof these streams. In addition, other data flows, such as P2Pvideoconferencing, database synchronization, and the like, can benefitfrom this functionality.

Web Service Networks

As web services become more widely accepted, one of the key weaknessesof the systems will be the messaging that occurs between them. Whilethere are several protocols being developed to support thisfunctionality (SOAP, UDDI, etc.) there are two missing ingredients. Thefirst is a set of application layer functionality, such as security,queuing, and non-repudiation/logging. The second is a transportmechanism for these messages.

One of ordinary skill will appreciate that the above-described overlaytechnologies are advantageous as they determine the entire outbound andinbound routes for a given communication using the BGP routing layer asa black box, and hence are not susceptible to BGP-related issues.

One of ordinary skill in the art will recognize that the presentinvention facilitates Layer 3 general purpose routing of IP trafficwithin a distributed, potentially global computer network for increasedperformance and reliability, perhaps dependant on application specificmetrics. The system will accept arbitrary IP flows and will route themthrough the content delivery network to attempt achieve a set ofmetrics, e.g., metrics established by metadata. The metadata could beapplied in a number of ways based on such properties as the source,destination, flow, application type, or the like. To achieve thesemetrics, the system also could decide to route the traffic on a per-hopand per-packet basis along an alternate route and/or multiple routes.

Having described our invention, what we now claim follows below.

1. An overlay network that provides a plurality of client machinesremote access to an application executing on a target server, whereineach client machine communicates with the application over the Internetusing Internet Protocol (IP) transport, comprising: a domain nameservice that is authoritative for a hostname associated with theapplication; a first server, a set of second servers, and a thirdserver, wherein each server in the overlay network receives andprocesses communications over IP, the first server having a virtual IPaddress determined by resolution of the hostname associated with theapplication; wherein, for each IP-based request data packet to becommunicated between a client machine and the application executing onthe target server, the first server encapsulates the request datapacket, duplicates the request data packet as encapsulated, and forwardsthe request data packet as encapsulated to the third server, the requestdata packet being forwarded to the third server over each of a set ofpaths that include at least one second server, the set of pathsincluding at least first and second paths from the first server to thethird server that do not share a same second server, the first serversubsequently receiving a response to the request data packet at thevirtual IP address; and wherein the third server processes received datato recover the request data packet, applies a network addresstranslation to the request data packet as recovered, and forwards therequest data packet to the target server for further processing by theapplication, wherein the network address translation also applies asource NAT to the packet before it is forwarded to the target server. 2.The overlay network as described in claim 1 wherein the third serverreceives the response from the target server as a response data packet,encapsulates the response data packet, duplicates the response datapacket as encapsulated, and forwards the response data packet asencapsulated over a set of paths.
 3. The overlay network method asdescribed in claim 2 wherein the first server processes received data torecover the response data packet, and forwards the response data packetto the client machine.